feat: voice-activity streaming mode & inner-vad for speech-to-text module by IgorSwat · Pull Request #1160 · software-mansion/react-native-executorch

IgorSwat · 2026-05-20T13:09:00Z

Description

This PR introduces changes focused on voice-activity-detection module and it's utilization within the library:

Native side VAD streaming - introduces a continuous voice-activity-detection mechanism with user-friendly callback system. Example usage from demo app:

  await model.stream({
    onSpeechBegin: () => {...},
    onSpeechEnd: () => {...},
    options: {...},
  });

VAD x STT integration - adds an option to utilize voice-activity-detection within the speech-to-text module, significantly improving the effective performance of the STT.
Demo apps: introduces new screen in the speech demo app: VoiceActivityDetectionScreen and changes the behavior of SpeechToTextScreen, adding a toggle to switch the VAD submodule for STT on/off.

Introduces a breaking change?

Yes
No

Type of change

Bug fix (change which fixes an issue)
New feature (change which adds functionality)
Documentation update (improves or adds clarity to existing documentation)
Other (chores, tests, code style improvements etc.)

Tested on

iOS
Android

Testing instructions

To test the VAD streaming: run the VoiceActivityDetectionScreen within the Speech demo app.
To test the VAD & STT integration: run the SpeechToTextScreen within the Speech demo app, with VAD toggle on.

Screenshots

Related issues

#1118

Checklist

I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have updated the documentation accordingly
My changes generate no new warnings

Additional notes

msluszniak · 2026-05-21T15:39:26Z

+      }
+    })();
+
+    while (this.isStreaming && !finished) {


stream() resolves as soon as this.isStreaming flips, but the native loop only re-checks the flag at the top of the next iteration — so for up to timeout + one inference after await streamStop() returns, the native streamer is still alive, can still queue callInvoker_->invokeAsync callbacks, and still touches audioBuffer_. If the caller then runs unload() (or the host object is destroyed) we're in UAF / use-after-unload territory.

Two options: (a) actually join — stream() doesn't resolve until the native stream() call returns, and streamStop() awaits that; or (b) document explicitly that unload() is not safe immediately after streamStop() and that callbacks may fire after the promise resolves. (a) is the safer contract.

chmjkb

besides my previous comment, I think it looks good, great work!

msluszniak

Review from local verification (VAD native tests pass 15/15, demo app boots). A few correctness items and minor cleanups inline.

The size check on audioBuffer_ raced with streamInsert writes under audioBufferMutex_. Move both the size comparison and the erase under a single lock so the read isn't concurrent with vector mutation.

generate() ran unlocked against a std::span pointing into audioBuffer_, relying on the vector's reservation never being exceeded. Unbounded streamInsert from JS could grow the buffer past capacity, trigger reallocation, and invalidate the span. Take a local copy under the lock instead so the inference operates on stable data.

Previously `lastMerged.end = current.end` would shrink the merged segment if a non-monotonic input arrived (current.end < lastMerged.end). postprocess() doesn't produce such input today, but the safer form removes the hidden invariant.

isStreaming stayed true after the native stream() resolved (whether normally or via error), so subsequent code relying on the flag saw stale state. Reset it in a finally block alongside the wake/finished bookkeeping.

Mutating the in-flight `options` to flip `useVAD` off before the final `finish()` call worked but left a footgun for anything later that reads back `options.useVAD`. Build a local copy with the override instead.

The function only reads from the span. Tagging it const signals intent and matches the equivalent OnlineASR::insertAudioChunk signature on the STT side.

`||` coerced an explicit `0` to the default 500. Switch to `??` so callers can pass 0 to disable the margin.

Explain what the 1.2 multiplier means — widens the VAD merge window relative to the user-configured detectionMargin so brief intra-utterance silences don't split a single utterance into separate segments.

`OnlineASR::process` computes the silence-trim cut as a `size_t` subtraction of these two constants. If either is tweaked such that the ordering inverts, the subtraction wraps and the subsequent `erase` reads past the buffer. Lock the invariant in at compile time.

SpeechToText gained a 4th positional `vadSource` argument; pass an empty string at all 9 existing call sites so the test still exercises the no-VAD path. Add the new VAD sources to the CMake target so the binary links.

mergeSegments: empty input, single-segment passthrough, distant segments stay separate, close/adjacent segments merge, overlapping shorter inner doesn't shrink the result, mixed sequence merges only adjacent close pairs. stream/streamInsert/streamStop: stream() loop exits promptly on streamStop, streamInsert while streaming doesn't crash, concurrent stream() throws StreamingInProgress, and stream can be restarted after a stop.

Covers the new PR behavior: - valid vadSource constructs without throwing - invalid vadSource fails loudly - one-shot transcribe() is unaffected when VAD is loaded - stream(useVAD=true) on a model built without VAD throws - stream(useVAD=true) over pure-silence audio drives the VAD branch of OnlineASR::process and exits cleanly via streamStop() Also register fsmn-vad in run_tests.sh so the SpeechToTextTests runner pushes the VAD model alongside the Whisper artifacts.

Required by the const-span signature of VoiceActivityDetection:: streamInsert. Without this, ModelHostObject's template instantiation of synchronousHostFunction<&VoiceActivityDetection::streamInsert> references an undefined symbol at link time on Android. Mirrors the existing std::span<float> specialization; the underlying getTypedArrayAsSpan<float>() helper returns a span over the same storage, which converts implicitly to span<const float>.

msluszniak

🚀

IgorSwat requested review from chmjkb and msluszniak May 20, 2026 13:09

IgorSwat force-pushed the @is/vad-streaming branch from 694fe4f to 1c2411e Compare May 20, 2026 13:15

IgorSwat changed the base branch from main to @is/speech-to-text-ultimate May 20, 2026 13:26

chmjkb requested changes May 20, 2026

View reviewed changes

msluszniak reviewed May 20, 2026

View reviewed changes

IgorSwat force-pushed the @is/speech-to-text-ultimate branch from 02113ff to 6bba141 Compare May 20, 2026 15:46

msluszniak reviewed May 20, 2026

View reviewed changes

Comment thread ...ve-executorch/common/rnexecutorch/models/voice_activity_detection/VoiceActivityDetection.cpp Outdated

chmjkb requested changes May 21, 2026

View reviewed changes

Comment thread apps/speech/screens/SpeechToTextScreen.tsx

Comment thread apps/speech/screens/VoiceActivityDetectionScreen.tsx

Base automatically changed from @is/speech-to-text-ultimate to main May 21, 2026 08:20

IgorSwat force-pushed the @is/vad-streaming branch from 1c2411e to 0ea858d Compare May 21, 2026 08:55

msluszniak assigned IgorSwat May 21, 2026

msluszniak added the feature PRs that implement a new feature label May 21, 2026

IgorSwat requested a review from benITo47 May 21, 2026 12:49

This comment was marked as resolved.

Sign in to view

chmjkb reviewed May 21, 2026

View reviewed changes

Comment thread docs/docs/03-hooks/01-natural-language-processing/useSpeechToText.md Outdated

chmjkb reviewed May 21, 2026

View reviewed changes

Comment thread docs/docs/04-typescript-api/01-natural-language-processing/VADModule.md Outdated

benITo47 reviewed May 21, 2026

View reviewed changes

Comment thread ...ages/react-native-executorch/common/rnexecutorch/models/voice_activity_detection/Constants.h Outdated

benITo47 requested changes May 21, 2026

View reviewed changes